107 research outputs found

    Cascade Learning by Optimally Partitioning

    Full text link
    Cascaded AdaBoost classifier is a well-known efficient object detection algorithm. The cascade structure has many parameters to be determined. Most of existing cascade learning algorithms are designed by assigning detection rate and false positive rate to each stage either dynamically or statically. Their objective functions are not directly related to minimum computation cost. These algorithms are not guaranteed to have optimal solution in the sense of minimizing computation cost. On the assumption that a strong classifier is given, in this paper we propose an optimal cascade learning algorithm (we call it iCascade) which iteratively partitions the strong classifiers into two parts until predefined number of stages are generated. iCascade searches the optimal number ri of weak classifiers of each stage i by directly minimizing the computation cost of the cascade. Theorems are provided to guarantee the existence of the unique optimal solution. Theorems are also given for the proposed efficient algorithm of searching optimal parameters ri. Once a new stage is added, the parameter ri for each stage decreases gradually as iteration proceeds, which we call decreasing phenomenon. Moreover, with the goal of minimizing computation cost, we develop an effective algorithm for setting the optimal threshold of each stage classifier. In addition, we prove in theory why more new weak classifiers are required compared to the last stage. Experimental results on face detection demonstrate the effectiveness and efficiency of the proposed algorithm.Comment: 17 pages, 20 figure

    Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry

    Full text link
    The discrimination and simplicity of features are very important for effective and efficient pedestrian detection. However, most state-of-the-art methods are unable to achieve good tradeoff between accuracy and efficiency. Inspired by some simple inherent attributes of pedestrians (i.e., appearance constancy and shape symmetry), we propose two new types of non-neighboring features (NNF): side-inner difference features (SIDF) and symmetrical similarity features (SSF). SIDF can characterize the difference between the background and pedestrian and the difference between the pedestrian contour and its inner part. SSF can capture the symmetrical similarity of pedestrian shape. However, it's difficult for neighboring features to have such above characterization abilities. Finally, we propose to combine both non-neighboring and neighboring features for pedestrian detection. It's found that non-neighboring features can further decrease the average miss rate by 4.44%. Experimental results on INRIA and Caltech pedestrian datasets demonstrate the effectiveness and efficiency of the proposed method. Compared to the state-of-the-art methods without using CNN, our method achieves the best detection performance on Caltech, outperforming the second best method (i.e., Checkboards) by 1.63%.Comment: 9 pages,17 figure

    Learning Multilayer Channel Features for Pedestrian Detection

    Full text link
    Pedestrian detection based on the combination of Convolutional Neural Network (i.e., CNN) and traditional handcrafted features (i.e., HOG+LUV) has achieved great success. Generally, HOG+LUV are used to generate the candidate proposals and then CNN classifies these proposals. Despite its success, there is still room for improvement. For example, CNN classifies these proposals by the full-connected layer features while proposal scores and the features in the inner-layers of CNN are ignored. In this paper, we propose a unifying framework called Multilayer Channel Features (MCF) to overcome the drawback. It firstly integrates HOG+LUV with each layer of CNN into a multi-layer image channels. Based on the multi-layer image channels, a multi-stage cascade AdaBoost is then learned. The weak classifiers in each stage of the multi-stage cascade is learned from the image channels of corresponding layer. With more abundant features, MCF achieves the state-of-the-art on Caltech pedestrian dataset (i.e., 10.40% miss rate). Using new and accurate annotations, MCF achieves 7.98% miss rate. As many non-pedestrian detection windows can be quickly rejected by the first few stages, it accelerates detection speed by 1.43 times. By eliminating the highly overlapped detection windows with lower scores after the first stage, it's 4.07 times faster with negligible performance loss

    Learning Sampling Distributions for Efficient Object Detection

    Full text link
    Object detection is an important task in computer vision and learning systems. Multistage particle windows (MPW), proposed by Gualdi et al., is an algorithm of fast and accurate object detection. By sampling particle windows from a proposal distribution (PD), MPW avoids exhaustively scanning the image. Despite its success, it is unknown how to determine the number of stages and the number of particle windows in each stage. Moreover, it has to generate too many particle windows in the initialization step and it redraws unnecessary too many particle windows around object-like regions. In this paper, we attempt to solve the problems of MPW. An important fact we used is that there is large probability for a randomly generated particle window not to contain the object because the object is a sparse event relevant to the huge number of candidate windows. Therefore, we design the proposal distribution so as to efficiently reject the huge number of non-object windows. Specifically, we propose the concepts of rejection, acceptance, and ambiguity windows and regions. This contrasts to MPW which utilizes only on region of support. The PD of MPW is acceptance-oriented whereas the PD of our method (called iPW) is rejection-oriented. Experimental results on human and face detection demonstrate the efficiency and effectiveness of the iPW algorithm. The source code is publicly accessible.Comment: 14 pages, 13 figure

    Simultaneously Learning Neighborship and Projection Matrix for Supervised Dimensionality Reduction

    Full text link
    Explicitly or implicitly, most of dimensionality reduction methods need to determine which samples are neighbors and the similarity between the neighbors in the original highdimensional space. The projection matrix is then learned on the assumption that the neighborhood information (e.g., the similarity) is known and fixed prior to learning. However, it is difficult to precisely measure the intrinsic similarity of samples in high-dimensional space because of the curse of dimensionality. Consequently, the neighbors selected according to such similarity might and the projection matrix obtained according to such similarity and neighbors are not optimal in the sense of classification and generalization. To overcome the drawbacks, in this paper we propose to let the similarity and neighbors be variables and model them in low-dimensional space. Both the optimal similarity and projection matrix are obtained by minimizing a unified objective function. Nonnegative and sum-to-one constraints on the similarity are adopted. Instead of empirically setting the regularization parameter, we treat it as a variable to be optimized. It is interesting that the optimal regularization parameter is adaptive to the neighbors in low-dimensional space and has intuitive meaning. Experimental results on the YALE B, COIL-100, and MNIST datasets demonstrate the effectiveness of the proposed method

    Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning

    Full text link
    Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships between the global image feature vector and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks

    Video Summarization with Attention-Based Encoder-Decoder Networks

    Full text link
    This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, the output is a keyshot sequence. Our key idea is to learn a deep summarization network with attention mechanism to mimic the way of selecting the keyshots of human. To this end, we propose a novel video summarization framework named Attentive encoder-decoder networks for Video Summarization (AVS), in which the encoder uses a Bidirectional Long Short-Term Memory (BiLSTM) to encode the contextual information among the input video frames. As for the decoder, two attention-based LSTM networks are explored by using additive and multiplicative objective functions, respectively. Extensive experiments are conducted on three video summarization benchmark datasets, i.e., SumMe, and TVSum. The results demonstrate the superiority of the proposed AVS-based approaches against the state-of-the-art approaches,with remarkable improvements from 0.8% to 3% on two datasets,respectively..Comment: 9 pages, 7 figure

    Query-Aware Sparse Coding for Multi-Video Summarization

    Full text link
    Given the explosive growth of online videos, it is becoming increasingly important to relieve the tedious work of browsing and managing the video content of interest. Video summarization aims at providing such a technique by transforming one or multiple videos into a compact one. However, conventional multi-video summarization methods often fail to produce satisfying results as they ignore the user's search intent. To this end, this paper proposes a novel query-aware approach by formulating the multi-video summarization in a sparse coding framework, where the web images searched by the query are taken as the important preference information to reveal the query intent. To provide a user-friendly summarization, this paper also develops an event-keyframe presentation structure to present keyframes in groups of specific events related to the query by using an unsupervised multi-graph fusion method. We release a new public dataset named MVS1K, which contains about 1, 000 videos from 10 queries and their video tags, manual annotations, and associated web images. Extensive experiments on MVS1K dataset validate our approaches produce superior objective and subjective results against several recently proposed approaches.Comment: 10 pages, 8 figure

    Cascaded Subpatch Networks for Effective CNNs

    Full text link
    Conventional Convolutional Neural Networks (CNNs) use either a linear or non-linear filter to extract features from an image patch (region) of spatial size H×W H\times W (Typically, H H is small and is equal to W W, e.g., H H is 5 or 7). Generally, the size of the filter is equal to the size H×W H\times W of the input patch. We argue that the representation ability of equal-size strategy is not strong enough. To overcome the drawback, we propose to use subpatch filter whose spatial size h×w h\times w is smaller than H×W H\times W . The proposed subpatch filter consists of two subsequent filters. The first one is a linear filter of spatial size h×w h\times w and is aimed at extracting features from spatial domain. The second one is of spatial size 1×1 1\times 1 and is used for strengthening the connection between different input feature channels and for reducing the number of parameters. The subpatch filter convolves with the input patch and the resulting network is called a subpatch network. Taking the output of one subpatch network as input, we further repeat constructing subpatch networks until the output contains only one neuron in spatial domain. These subpatch networks form a new network called Cascaded Subpatch Network (CSNet). The feature layer generated by CSNet is called csconv layer. For the whole input image, we construct a deep neural network by stacking a sequence of csconv layers. Experimental results on four benchmark datasets demonstrate the effectiveness and compactness of the proposed CSNet. For example, our CSNet reaches a test error of 5.68% 5.68\% on the CIFAR10 dataset without model averaging. To the best of our knowledge, this is the best result ever obtained on the CIFAR10 dataset

    Transductive Zero-Shot Learning with Adaptive Structural Embedding

    Full text link
    Zero-shot learning (ZSL) endows the computer vision system with the inferential capability to recognize instances of a new category that has never seen before. Two fundamental challenges in it are visual-semantic embedding and domain adaptation in cross-modality learning and unseen class prediction steps, respectively. To address both challenges, this paper presents two corresponding methods named Adaptive STructural Embedding (ASTE) and Self-PAsed Selective Strategy (SPASS), respectively. Specifically, ASTE formulates the visualsemantic interactions in a latent structural SVM framework to adaptively adjust the slack variables to embody the different reliableness among training instances. In this way, the reliable instances are imposed with small punishments, wheras the less reliable instances are imposed with more severe punishments. Thus, it ensures a more discriminative embedding. On the other hand, SPASS offers a framework to alleviate the domain shift problem in ZSL, which exploits the unseen data in an easy to hard fashion. Particularly, SPASS borrows the idea from selfpaced learning by iteratively selecting the unseen instances from reliable to less reliable to gradually adapt the knowledge from the seen domain to the unseen domain. Subsequently, by combining SPASS and ASTE, we present a self-paced Transductive ASTE (TASTE) method to progressively reinforce the classification capacity. Extensive experiments on three benchmark datasets (i.e., AwA, CUB, and aPY) demonstrate the superiorities of ASTE and TASTE. Furthermore, we also propose a fast training (FT) strategy to improve the efficiency of most of existing ZSL methods. The FT strategy is surprisingly simple and general enough, which can speed up the training time of most existing methods by 4~300 times while holding the previous performance
    • …
    corecore